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1 Introduction 


Big data analytics is expensive. Textbooks tell us that a computation problem is tractable if there exists a polyno- 
mial-time algorithm for it, i.e., its cost can be expressed as a polynomial in the size n of the input''!, However, it is no 
longer the case when it comes to big data. When 7 is big, algorithms in O(n’) or even O(n) time may take too long to be 
practical. Indeed, assuming the largest Solid State Drives (SSD) with 12GB/s for read”, a linear scan of a dataset of 
15TB takes more than 20 minutes. It easily takes hours to join tables with millions of tuples’!, In other words, many 


computation problems that are typically considered tractable may become infeasible in the context of big data. 


One might be tempted to think that parallel computation could solve the problem, by adding more processors when 


needed. However, small businesses often have constrained resources and cannot afford large-scale parallel computation. 


Is big data analytics a privilege of big companies? Is it beyond the reach of small companies that can only afford 


constrained resources? 


We argue that big data analytics is possible under constrained resources. As an example, we consider relational 
query answering. Given an SQL query Q and a relational database D, it is to compute the answers Q(D) to Q in D. Re- 
lational data accounts for the majority of data in industry. Moreover, it is nontrivial to compute Q(D). Indeed, it is 
NP-complete to decide whether a given tuple ¢ is in Q(D) when Q is a simple SPC query (selection, projection and Car- 
tesian product), and is PSPACE-complete when Q is in relational algebra, both subsumed by SQL. 


We propose BEAS, a new query evaluation paradigm to answer SQL queries under constrained resources. The idea 
is to make big data small, i.e., to reduce queries on big data to computation on small data. Underlying BEAS are two 


principled approaches: (1) bounded evaluation that computes exact answers by accessing a bounded amount of data 
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when possible“ ™, and (2) a data-driven approximation scheme that answers queries for which exact answers are beyond 


reach under bounded resources, and offers a deterministic accuracy bound". 


As proof of concept, we have developed a prototype system!'®! and evaluated it with call-detailed-record (CDR) 
queries at Huawei. We find that bounded evaluation improves the performance of CDR queries by orders of magnitude", 
and is able to reduce big datasets from PB (10'°) to GB (10°) in many cases. 


Below we give a brief introduction to bounded evaluation (Section 2), followed by data-driven approximation 
scheme (Section 3). Putting these together, we present BEAS, our resource-bounded query evaluation framework 
(Section 4). 


2 Bounded Evaluation 


Given a query Q posed on a big dataset D, bounded evaluation!” aims to compute Q(D) by accessing only a 
bounded subset Dg of D that includes necessary information for answering Q in D, instead of the entire D. To identify 
Do, it makes use of an access schema A, which is a set of access constraints, i.e., a combination of simple cardinality 


constraints and their associated indices. 


Under access schema A, query Q is boundedly evaluable if for all datasets D that conform to A, there exists a frac- 
tion DoS D such that 


°Q(Do) = O(D), i.e., Do suffices for computing exact answers Q(D); and 


othe time for identifying Dg and hence the size |Do| of Dg are determined by Q and A only. That is, the cost of 
computing Q(Do) is independent of |D]. 


Intuitively, Q(D) can be computed by accessing Do. We identify Dg by reasoning about the cardinality constraints in 
A, and fetch it by using the indices in A. 
Example 1: Consider a database schema & consisting of three relations: 
(a) person(pid, city), stating that pid lives in city, 
(b) friend(pid, fid), saying that fid is a friend of pid, and 
(c) poi(address, type, city, price), for the type, price and city of points of interest. 
An example access schema A consists of the following two access constraints: 
°: friend(pid — fid, 5000), 
°: person(pid — city, 1). 
Here g, is a constraint imposed by Facebook'!'”!: a limit of 5000 friends per person; and q states that each person 


lives in at most one city. An index is built for ø, such that given a pid, it returns all fids of pid from friend, i.e., g in- 


cludes the cardinality constraint (5000 fids for each pid) and the index; similarly for 2. 


Consider a query Q; to find the cities where my friends live, which is taken from Graph Search of Facebook!” 
Written in SQL, Q; can be expressed as: 


select p.city 
from friend as f, person as p 
where f.pid = pp and f.fid = p.pid, 


where po indicates “me”. When an instance Do of & is “big”, e.g., Facebook has billions of users and trillions of 
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friend links!'”!, it is costly to compute Q\(D,). 

However, we can do better since Q, is boundedly evaluable under A: (a) we first identify and fetch at most 5000 fids 
for person po from relation friend by using the index for gl, and (b) for each fid fetched, we get her city by fetching 1 
tuple from relation person via the index for œ. In total we fetch a set Dg of 10,000 tuples, instead of trillions; it suffices 


to compute Q;(Do) by using Dg only"), 


The notion of access schema was proposed in'*!, and the foundation of bounded evaluation was established in”, 
The theory of bounded evaluation was first evaluated using SPC queries'®! and then extended to relational alge- 
bra(RA)"!, A challenge is that it is undecidable to decide whether an SQL query is boundedly evaluable under an access 
schema A '!, To cope with this, an effective syntax L was developed for boundedly evaluable RA queries'*!, That is, L is 
a class of RA queries such that under A, 

(a) an RA query Q is boundedly evaluable if and only if it is equivalent to a query Q’ in L; and 

(b) it takes PTIME (polynomial time) in the size |Q| of Q and size |A| of A to check whether Q’ is in L, reducing the 
problem to syntactic checking. 

That is, Z identifies the core subclass of boundedly evaluable RA queries, without sacrificing their expressive power. 
This is analogous to the study of safe relational calculus queries, which is also undecidable'’*!, Based on the effective 
syntax, we can efficiently check whether a query Q is boundedly evaluable, and if so, generate a query plan to answer Q 
by accessing a bounded fraction Dg of D“. 

It has been shown that bounded evaluation can be readily built on top of commercial DBMS (database management 
systems) such as MySQL and PostgreSQL. It extends the DBMS with an immediate capability of querying big relations 
under constrained resources!“ A prototype system was developed in''®!, We find that about 77% of SPC queries'®! and 
67% of SQL queries'*! are boundedly evaluable. Better yet, more than 90% of the CDR queries at Huawei are bound- 
edly evaluable. 


3 Data Driven Approximation 


For queries Q that are not boundedly evaluable, can we evaluate Q against a big dataset D under constrained re- 
sources? We answer the question in the affirmative. 

We propose a data-driven scheme for approximate query answering. It is parameterized with a resource ratio a E 
(0, 1), indicating that our available resources can only access an a-fraction of big dataset D. Given a, D and a query Q 
over D, it identifies Dg S D, and computes O(Dg) and an accuracy bound 7 © (0, 1] such that 

(1) |Do| < a|D|, where |Do| is measured in its number of tuples; and 

(2) accuracy(Q(Do), Q, D) = y. 

Intuitively, it computes approximate answers Q(Dg) by accessing at most a|D| tuples in the entire process. Thus it 
can scale with D when D grows big by setting a small. Moreover, Q(Do) assure a deterministic accuracy bound 7: 

(a) for each approximate answer s © Q(Dọ), there exists an exact answer tE Q(D) that is 7-close to s, i.e., s is within 
distance 7 of t, and 

(b) for each exact answer t E O(D), there exists an approximate answer s © Q(Dọ) that is 7-close to t. 

That is, Q(Dg) includes only “relevant” answers, and “covers” all exact answers. It finds sensible answers in users’ 


interest, and suffices for exploratory queries, e.g., real-time problem diagnosis on logs!"”. 


[16] 


The objective is ambitious. As observed in* ™, approximate query answering is challenging. Previous approaches 
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often adopt an one-size-fit-all synopsis D. and computes QO(D.) for all queries Q posed on D. The approaches “substan- 
»[15] 


tially limit the types of queries they can execute”, and often focus on aggregate queries (max, min, avg, sum, count). 
Moreover, they make various assumptions on future queries, i.e., workloads, query predicates or QCSs, i.e., “the fre- 
quency of columns used for grouping and filtering does not change over time”. Worse still, they provide either no accu- 
racy guarantee at all, or probabilistic error rates for aggregate queries only. Such error rates do not tell us how “good” 
each approximate answer is. 

Nonetheless, data-driven approximation is feasible under access schema. 

Example 2: Continuing with Example 1, consider query Q» to find me hotels that cost at most $95 per night and are 
in a city where one of my friends lives: 

select h.address, h.price 

from poi as h, friend as f, person as p 

where f.pid = pọ and f.fid = p.pid and p.city = h.city and h.type = “hotel” and h.price < 95 

We can compute Q-(Do) in a big dataset Dy of trillions of tuples given a small a, e.g., 10%, i.e., when our available 
resources can afford to access at most 10 * * |Do| tuples. This is doable by using an access schema Ap, which includes ø; 


and œ of Example 1 and in addition, the following extended access constraints: 


eyn: poi({type, city} — {price, address}, 1,(e,, e4)), 


°Wm: poi( {type, city} — {price, address}, 2”,(e; , ex)), where m =| log, M |]. 

Here M is the maximum number of distinct poi tuples in Do grouped by (type, city). We build an index for each w; in 
Ay such that for i€[1, m], given any (type, city)-value (c, c,), we can retrieve a set T of at most 2'(price, address) values 
from Do by using the index for y;; moreover, or each poi tuple (c'a, Cr Ce, c’,) in Do, there exists (cp, Ca) © T such that the 
(price, address)-value (c’,, c’,) differs from (cp, Ca) by distance at most (e,, e4). That is, T represents (price, address) 
values that correspond to (c, c.) with at most 2' tuples, subject to distances(e,, e,). Intuitively, the indices give a 
hierarchical representation of relation poi with different resolutions i€[1, m]. The higher the resolution i is, the smaller 
the distance (e;, ¢’,) is, and the more accurate the index for y; represents Do. 

Assume @|Do| > 10000, as in Facebook dataset Do. Then under Ap, we can find hotels by accessing at most a|Do| tu- 
ples as follows: (a) fetch a set T, of fid’s with po by accessing at most 5000 friend tuples using @); (b) for each fid in T}, 
fetch 1 associated city with œ, yielding a set T, of at most 5000 city values; (c) for each city c in Tù, fetch at most 2"« 


(price, address) pairs corresponding to (“hotel”, c) by using w, „ where k, = | log, (a | Do | —10000) | ; and (d) return a set 


S of those (price, address) values with price at most (95 + ef ), aS approximate answers to Q, in Dy. The process ac- 


cesses at most 5000 + 5000+ 2% < a|Do| tuples in total. 
The set S of answers is accurate: (1) for each hotel Ao(c,, Ca) in the exact answers Q(Dp), there exists (c’,, c’,) in S 


that are within eto and eke of c, and ca, respectively; and (2) for each hotel h(c’>, c'a) in S, its price c'» exceeds 95 by at 


most ef, e.g., eke =4 and c’,=99, and c'a is the address of hotel h’. Moreover, the larger a is, the smaller eto and 


eke are, and the more accurate S is. 

It has been shown!” that for any dataset D, there exists such an access schema A such that D conforms to A, and for 
any resource ratio a E(0, 1] and SQL queries Q over D, aggregate or not, there exists a dataset DoS D identified by 
reasoning about A and a deterministic accuracy bound y such that |Dọo] < a|D| and accuracy (Q(Do), Q, D) = n. More- 


over, the larger a is, the higher 77 is. 
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As opposed to previous approximate query answering approaches, the data-driven approximation scheme is able to 
answer SQL queries Q that are (1) unpredictable, i.e., without assuming any prior knowledge about Q, (2) and generic, 


aggregate or not, with (3) deterministic accuracy y in terms of both relevance and coverage. 


We find that the approximation scheme computes approximate answers to SQL queries, aggregate or not, with ac- 
curacy 7 > 0.82 for SQL queries, even when a is as small as 5.5 x 10 *"!, That is, it reduces D of PB size to Do of 
550GB. 


4 A Resource Bounded Query Evaluation Framework 


We are now ready to present BEAS (Boundedly EvAluable Sql), a resource-bounded framework for querying big 
relations. For a big dataset D in an application, BEAS takes a resource ratio a€ (0, 1] as a parameter, and discovers an 


access schema A. Given an SQL query Q posed on D, BEAS works as follows: 


(1) it checks whether Q is boundedly evaluable under A, i.e., exact answers Q(D) can be computed by accessing 
DoS D such that |Do is independent of |D]; 


(2) if so, it computes Q(D) by accessing a bounded fraction Dg of D; 


(3) otherwise, BEAS identifies Dg with |Do| < a|D|, and computes Q(Dg) with a deterministic accuracy bound y, 


based on data-driven approximation. 


That is, under the resource constraint a, BEAS computes exact answers Q(D) when possible, and approximate an- 


swers O(Do) otherwise with accuracy 7. 


As opposed to conventional DBMS, BEAS is unique in its ability to (1) comply with resource ratio a, i.e., it can 
scale with arbitrarily large datasets D by adjusting a based on available resources, (2) decide whether Q is boundedly 
evaluable, (3) answer unpredictable and generic SQL queries, aggregate or not, with deterministic accuracy y, and (4) 
be plugged into commercial DBMS and provide the DBMS with an immediate capacity to query big relations under 


constrained resources. 


In light of these, BEAS is promising for providing small companies with the capability of big data analytics and 
hence, to benefit from big data services. It can also help big companies such as Huawei to reduce the cost and improve 
efficiency!''!, Moreover, parallel processing is not a silver bullet for big data analytics. Indeed, one might expect a par- 
allel algorithm to have the parallel scalability, i.e., the algorithm would run faster given more processors. However, few 
algorithms in the literature have this performance guarantee. Worse yet, some computation problems are not parallel 
scalable, i.e., there exist no algorithms for them such that their running time can be substantially reduced by adding 
processors, no matter how many processors are used!!7'*!, For such computation problems, bounded evaluation and 
data-driven approximation offer a feasible solution. 


The idea of resource-bounded query answering is not limited to relations. It has been shown that bounded evaluation 


improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem"! 


[19-20] 


that is widely 
used in social media marketing and knowledge base expansion""!, by 4 orders of magnitude on average!”!, For 
personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (i.e., 47 = 1) 


~6 [23] 


when a is as small as 1.5 * 10 , i.e., when processing graphs G of IPB, they access only 15GB of data, i.e., reducing 


G from PB to GB while retaining high accuracy! 
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